[python] New embeddings API #1023

ebezzi · 2024-02-28T18:49:38Z

New embeddings API. Provides a unified access pattern to embeddings (regardless of whether they're collaboration or hosted) through get_anndata.

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

ebezzi · 2024-02-28T18:53:16Z

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py

+            The Census version tag, e.g., ``"2023-12-15"``.
+
+    Returns:
+        A list of dictionaries, each containing metadata describing an available embedding.


Options:

Return a subset of the metadata that only has relevant information (name, organism, etc). The example listed here is only for reference

Return the full metadata.

strongly prefer 1) and having a verbose argument

tools/census_contrib/src/census_contrib/metadata.py

tools/census_contrib/embedding_metadata.md

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

ebezzi · 2024-03-01T22:31:55Z

api/python/cellxgene_census/src/cellxgene_census/_util.py

@@ -18,3 +19,7 @@ def _uri_join(base: str, url: str) -> str:
        p_url.fragment,
    ]
    return urllib.parse.urlunparse(parts)
+
+def _extract_census_version(census: soma.Collection):


I created a live corpus unit test for this method. This should ensure that this parsing method remains consistent across releases.

this code has some lint - please run (an up to date) pre-commit across it

api/python/cellxgene_census/tests/test_get_anndata.py

ebezzi · 2024-03-01T22:46:28Z

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

+            for emb in add_obs_embeddings:
+                emb_metadata = get_embedding_metadata_by_name(emb, organism, census_version, "obs_embedding")
+                uri = f"{CENSUS_EMBEDDINGS_LOCATION_BASE_URI}/{census_version}/{emb_metadata['id']}"
+                embedding = get_embedding(census_version, uri, obs_soma_joinids)


Note: this will cause the census object to be re-opened. While this shouldn't be an issue, it will result into an extra call. With some effort I can refactor get_embedding to also accept an existing Census object, but I'm not sure if it's worth it.

IMHO, you should refactor the code to have a (common, shared) function that accepts an already open Census handle

bkmartinjr

No real concerns with functionality, but suggest you do an API review with Pablo and Mike to figure out if the signatures are friendly/comprehensible from a UX perspective

metakuni · 2024-03-08T22:53:02Z

Sorry, my clumsy fingers clicked the wrong button!

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

bkmartinjr · 2024-03-15T23:45:08Z

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

+
+            if add_obs_embeddings:
+                if obsm_layers and [x for x in add_obs_embeddings if x in obsm_layers]:
+                    raise ValueError(


Given the (high) cost of calling query.to_anndata(), you should do all error checking before the costly ops - ie., move this kind of stuff to a prologue

Good point.

bkmartinjr · 2024-03-15T23:46:02Z

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

+                obs_soma_joinids = query.obs_joinids()
+                for emb in add_obs_embeddings:
+                    emb_metadata = get_embedding_metadata_by_name(emb, experiment_name, census_version, "obs_embedding")
+                    uri = f"{CENSUS_EMBEDDINGS_LOCATION_BASE_URI}/{census_version}/{emb_metadata['id']}"


shouldn't these use urljoin()?

bkmartinjr · 2024-03-15T23:47:56Z

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

+            census_directory = get_census_version_directory()
+
+            if add_obs_embeddings:
+                if obsm_layers and [x for x in add_obs_embeddings if x in obsm_layers]:


this error check (name collision) is missing from the var axis. Seems like you need it for both

I removed varm_layers from the arguments, as you suggested, since it is not in the current API. The only way you can request a varm is through add_varm_embeddings

Right - gotcha!

So, I'm thinking about this previous conversation a bit more. Coming around to a slightly different perspective. Given:

tiledbsoma/somacore support more functionality than is exposed here (e.g., varp_layers, etc)

Census doesn't currently use these, due to schema definition (not code)

we want to clearly separate the arg names for clarity (the previous conversation where we decided to remove varm_layers)

I think we should land roughly here, to keep the code & schema modular:

this function (get_anndata) should pass through all (or at least most) arguments supported by ExperimentAxisQuery.to_anndata. That makes it future-proof and decoupled from the schema

the newly added args should have clearly separated names (the current names are not clear per above comment)

the error checking needs to detect collisions because AnnData only has one "dict" to shove them all in, and they might conflict.

Boiling this down, I suggest:

Add (and pass to query.to_anndata) the args obsm_layers, varm_layers, obsp_layers, and varp_layers. These are just pass through.

Rename add_obs_embeddings and add_var_embeddings to something clear (see above discussion)

Do the error checks for collisions, and do them before you do any data loading. Example below.

Example error check for obs:

if set(obsm_layers) & set(add_obs_embeddings): ... there is a collision error ... ....do same for varm ...

Sounds good.

api/python/cellxgene_census/src/cellxgene_census/_util.py

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py

bkmartinjr · 2024-03-16T01:15:20Z

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py

+    response = requests.get(CELL_CENSUS_EMBEDDINGS_MANIFEST_URL)
+    response.raise_for_status()
+
+    versions = set()


this whole blob could be a simple comprehension which might make it more pythonic (this is a nit, up to you). E.g,

return sorted({ obj['census_version'] for obj in manifest.values() if ... })

And unless there are duplicates expected, I'm not sure what the set adds? If there are duplicates, doesn't that imply you need more filter criteria?

The set is because multiple embeddings can exist for a single alias, and in this case we're only interested in the census version string, so it needs to be deduplicated. I'll rewrite using the comprehension.

codecov · 2024-03-18T20:11:27Z

Codecov Report

Attention: Patch coverage is 93.26425% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 82.32%. Comparing base (a5dbdef) to head (520941e).
Report is 2 commits behind head on main.

Files	Patch %	Lines
...us/src/cellxgene_census/experimental/_embedding.py	87.09%	8 Missing ⚠️
...hon/cellxgene_census/src/cellxgene_census/_util.py	71.42%	2 Missing ⚠️
...xgene_census/tests/experimental/test_embeddings.py	95.23%	2 Missing ⚠️
...lxgene_census/src/cellxgene_census/_get_anndata.py	96.55%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1023      +/-   ##
==========================================
+ Coverage   81.33%   82.32%   +0.99%     
==========================================
  Files          73       74       +1     
  Lines        5566     5714     +148     
==========================================
+ Hits         4527     4704     +177     
+ Misses       1039     1010      -29

Flag	Coverage Δ
unittests	`82.32% <93.26%> (+0.99%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pablo-gar

@ebezzi have you considered the convenience API to get all available Census version for a given embedding name?

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py

pablo-gar · 2024-03-20T00:04:03Z

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py

+            The Census version tag, e.g., ``"2023-12-15"``.
+
+    Returns:
+        A list of dictionaries, each containing metadata describing an available embedding.


strongly prefer 1) and having a verbose argument

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py

ebezzi · 2024-03-20T17:06:15Z

@ebezzi have you considered the convenience API to get all available Census version for a given embedding name?

@pablo-gar get_all_census_versions_with_embedding that would be the function.

Draft

d126590

ebezzi commented Feb 28, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py Show resolved Hide resolved

ebezzi commented Feb 28, 2024

View reviewed changes

tools/census_contrib/src/census_contrib/metadata.py Outdated Show resolved Hide resolved

ebezzi commented Feb 28, 2024

View reviewed changes

tools/census_contrib/embedding_metadata.md Show resolved Hide resolved

ebezzi changed the title ~~[python] New embeddings API~~ [python] New embeddings API draft Feb 28, 2024

ebezzi added 5 commits February 28, 2024 13:27

Feedback

c221799

Add basic unit test

43534b0

Add basic unit test, pass 2

b0472e3

Checkpoint

ddafdf0

Checkpoint

337b571

ebezzi commented Mar 1, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py Outdated Show resolved Hide resolved

ebezzi commented Mar 1, 2024

View reviewed changes

api/python/cellxgene_census/tests/test_get_anndata.py Outdated Show resolved Hide resolved

ebezzi commented Mar 1, 2024

View reviewed changes

ebezzi requested a review from bkmartinjr March 1, 2024 23:46

bkmartinjr reviewed Mar 8, 2024

View reviewed changes

bkmartinjr requested review from mlin and pablo-gar March 8, 2024 19:47

metakuni closed this Mar 8, 2024

metakuni reopened this Mar 8, 2024

ebezzi added 5 commits March 8, 2024 15:42

lint part 1

77d28a5

Refactor variable

c91b1e8

More work

1f53a94

Remove varm_layers

49ced87

General refactor

299aeb7

ebezzi commented Mar 15, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py Show resolved Hide resolved

ebezzi marked this pull request as ready for review March 15, 2024 16:46

Add condition for obsm_layers

a867985

ebezzi requested a review from bkmartinjr March 15, 2024 20:58

bkmartinjr reviewed Mar 15, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py Show resolved Hide resolved

bkmartinjr reviewed Mar 15, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/_util.py Show resolved Hide resolved

bkmartinjr reviewed Mar 15, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py Show resolved Hide resolved

bkmartinjr reviewed Mar 16, 2024

View reviewed changes

PR comments

4656d89

pablo-gar approved these changes Mar 20, 2024

View reviewed changes

ebezzi changed the title ~~[python] New embeddings API draft~~ [python] New embeddings API Mar 27, 2024

ebezzi added 2 commits March 29, 2024 10:56

Switch to urijoin

1cbc2bc

Merge from main

520941e

ebezzi merged commit c6cf312 into main Apr 1, 2024
15 checks passed

ebezzi deleted the ebezzi/new-embeddings-api branch April 1, 2024 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] New embeddings API #1023

[python] New embeddings API #1023

ebezzi commented Feb 28, 2024 •

edited

Loading

ebezzi Feb 28, 2024

pablo-gar Mar 20, 2024

ebezzi Mar 1, 2024

bkmartinjr Mar 8, 2024

ebezzi Mar 1, 2024

bkmartinjr Mar 8, 2024

ebezzi Mar 15, 2024

bkmartinjr left a comment

metakuni commented Mar 8, 2024 •

edited

Loading

bkmartinjr Mar 15, 2024

ebezzi Mar 15, 2024

bkmartinjr Mar 15, 2024

bkmartinjr Mar 15, 2024

ebezzi Mar 15, 2024

bkmartinjr Mar 16, 2024

bkmartinjr Mar 16, 2024

ebezzi Mar 18, 2024

bkmartinjr Mar 16, 2024

ebezzi Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading

pablo-gar left a comment

pablo-gar Mar 20, 2024

ebezzi commented Mar 20, 2024

[python] New embeddings API #1023

[python] New embeddings API #1023

Conversation

ebezzi commented Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr left a comment

Choose a reason for hiding this comment

metakuni commented Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebezzi Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Mar 18, 2024 • edited Loading

Codecov Report

pablo-gar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebezzi commented Mar 20, 2024

ebezzi commented Feb 28, 2024 •

edited

Loading

metakuni commented Mar 8, 2024 •

edited

Loading

ebezzi Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading